from __future__ import division,print_function
import pandas as pd
from utils import *
import matplotlib.pyplot as plt
%matplotlib inline
Among the imagenet synsets, there are few related to watchs, clocks:
One easy approach would be to use one of the imagenet pre-trained models with full weights in order to predict the 1000 classes of imagenet. Among these classes, there are 5 related to watches/clocks (see above). We can use the sum of probabilities of these classes as a confidence score. This score would then be between 0 and 1. The highest the score is, the higher the confidence will be. Based on this score, we can apply hard labeling as well, i.e. Boolean output for if there is a watch or not in the image. But how to pass to binary outputs? We will fine-tune the threshold allowing to predict boolean values.
In order to validate in this approach, we can simply simply split manually the data into two different subfolders (watch or not). As the database is quiet small, not more than 635 images, it only takes few minutes. Once the predictions done on the entire set images, we can compare the predictions both probas and labels using specific metrics: roc_auc_curve, f1-score, accuracy. Of course the risk of introducing noise in the validation labels is high, as only one person constitutes the valdiation set (splitting the images into two classes vrey quickly), the judgement can differ from one person to another.
Here we only use ResNet50 for convenience. It's possible to use several other models for ensembling for further improvement.
If we can have the feedback of users on small amount of data, we can further improve the model by making it learn on its errors. One thing would be to integrate their feedback in a training set in order to retrain and finetune the model. Hence, the model would get better over the time by integrating the annotations of users on the predictions. This implies a progressive calibration of the model over the time.
path = './data'
bs_1 = 1
# batches of images
_batches = get_batches(path, batch_size=bs_1, shuffle=False)
Any of the models would do the job (VGG16, VGG19, ResNet, ...)
# model and imagenet probas
model = ResNet50(include_top=True, weights='imagenet')
probas_ = model.predict_generator(_batches, verbose=True)
We simply sum the probabilities of the 5 watch/clock imagenet synsets.
# adapt probas to our case
clock_ids = [409, 530, 531, 826, 892]
probas_clocks = []
for idx, el in enumerate(probas_):
probas_clocks.append(probas_[idx, clock_ids].sum()) # sum of clock children synsets probas
We need to fine-tune the theshold for the labeling. We will base on the validation results. It appears that the optimal threshold is around 0.01
# fine-tune the threshold for binary classification
for th in [0.01, 0.05, 0.10, 0.20, 0.25, 0.30, 0.35, 0.4, 0.45, 0.5]:
preds = [1 if el > th else 0 for el in probas_clocks] # binary output
y_test = _batches.classes # validation labels
roc_auc = roc_auc_score(y_test, probas_clocks)
f1 = f1_score(y_test, preds)
acc = accuracy_score(y_test, preds)
print("th: {} :: ROC AUC S: {} -- F1 S: {} -- ACC: {}".format(th, roc_auc, f1, acc))
# saving model weights
model.save_weights('./app/models/weights_resnet50.h5')
We can be tempted by measuring the low-level degradations such as noise, blur, etc.. For example, we can use the variance of the Laplace Transform of the image to assess the level of blurness. Measuring low-level degradations would be the simplest approach.
However, it is also possible to use CNNs to automate the quality assessment of images. In the recent years, Deep Learning and CNNs have proven to be efficient in this application as well. Here below are some racent papers and approaches to test:
We can use the MIMA pre-trained model (https://github.com/titu1994/neural-image-assessment). The model has been trained on the AVA dataset (Aesthetic Visual Analysis). (N.B. available in keras + tensorflow). It predicts the distribution of human opinion scores for the 'aesthetic quality'. Here we can take the mean of the scores to rank images based on their 'human like' aesthetic quality score. The lower the score is, the lower the aesthetic quality is likely to be.
PLEASE REFER TO UTILS FILES FOR RESPECTIVE FUNCTIONS
Here as the results are continuous and stick to the annotations of the initial train set (i.e. AVA), it would be complicated to validate the continous values.
ranking the images according to quality score:
We can rank randomly selected images and assess visually the coherence of the ranking. (See plot below). We can see that the lowest quality images are either strongly blurred or very small images.
It constitutes a naive and not "proper" approach to validate the model.
trying to reproduce the annotation system of the initial experiment, Using user feedbacks:
We can take fully benefit of user feedback
If annotators can annotate the images, based on the initial experiment rules, we can use their feedbacks as ground-truth and validate the model. Here the IQA model tries to predict the distribution of scores, and we're outputing the mean score.
We can ask 10 annotators to score the quality based on their human perception and original rules, and compute the mean score per image as well. Hence, comparing the two mean would be a solution. We can even directly compare the distribution of scores.
This solution is possible as the initial experiment is based on human socring as well.
Predict image quality, label (i.e. watch or not) and the confidence score.
path_to_images = './images_mix'
# LOADING IMAGES
print('Loading images')
test_images, test_index = load_images(path_to_images)
print('Found', len(test_images), 'images to predict on')
# LOADING MODELS
print('Loading models')
iqa = IQA('./app/models/inception_resnet_weights.h5')
tfl = ResNet50(include_top=True, weights='./app/models/weights_resnet50.h5')
# PREDICT
print('Predicting probas & labels: is there a watch ?')
th = 0.01
_probas = tfl.predict(test_images, batch_size=1, verbose=True)
probas_ = synsets2clocks(_probas)
labels_ = [el > th for el in probas_]
print('Predicting Image Quality scores')
#preprocessing adapted to inceptionresnet_v2
test_images = 2*(test_images/255.0)-1.0
# mean scores
scores = iqa.predict(test_images, batch_size=1, verbose=1)
scores_ = [mean_score(el) for el in scores]
out_dict = dict()
for idx, el in enumerate(test_index):
out_dict[el] = dict()
out_dict[el]['quality_score'] = scores_[idx]
out_dict[el]['watch_pred'] = (labels_[idx], probas_[idx])
df = pd.DataFrame.from_dict(out_dict, orient='index').sort_values(by='quality_score',ascending=False)
df.head()
The ranking looks coherent at least for the extreme cases.
def plot_demo(data):
""" Plotting Demos, with respective predictions"""
fig, ax = plt.subplots(ncols= 5, nrows=4, sharex=True, sharey=True, figsize=(20,15))
ax = ax.ravel()
for idx, im_p in enumerate(data.index):
im = imread(im_p)
ax[idx].imshow(im)
ax[idx].set_title('quality: {:.2f} - watch: {} | {:.2f}'.format(data.iloc[idx]['quality_score'],
data.iloc[idx]['watch_pred'][0],
data.iloc[idx]['watch_pred'][1]))
ax[idx].axis('off')
plt.tight_layout()
# top 20
plot_demo(df.head(20))
# last 20
plot_demo(df.tail(20))